External Lexical Information for Multilingual Part-of-Speech Tagging

نویسنده

  • Benoît Sagot
چکیده

Morphosyntactic lexicons and word vector representations have both proven useful for improving the accuracy of statistical part-of-speech taggers. Here we compare the performances of four systems on datasets covering 16 languages, two of these systems being feature-based (MEMMs and CRFs) and two of them being neural-based (bi-LSTMs). We show that, on average, all four approaches perform similarly and reach state-of-the-art results. Yet better performances are obtained with our feature-based models on lexically richer datasets (e.g. for morphologically rich languages), whereas neural-based results are higher on datasets with less lexical variability (e.g. for English). These conclusions hold in particular for the MEMM models relying on our system MElt, which benefited from newly designed features. This shows that, under certain conditions, featurebased approaches enriched with morphosyntactic lexicons are competitive with respect to neural methods. Key-words: Part-of-Speech Tagging, Feature-based models, Neural models, MEMM, CRF, biLSTM, Multilingual Analysis Utilisation d’informations lexicales externes pour l’annotation multilingue en parties du discours Résumé : Les lexiques morphosyntaxiques et les représentations vectorielles des mots ont chacun montré leur utilité pour améliorer la précision d’étiqueteurs morphosyntaxiques statistiques. Nous comparons ici les performances de quatre systèmes sur des jeux de données couvrant 16 langues, deux de ces systèmes reposant sur des traits (MEMM et CRF) et deux autres sur des approches neuronales (bi-LSTM). Nous montrons qu’en moyenne les quatre approches obtiennent des performances similaires de niveau état-de-l’art. Néanmoins, nos modèles reposant sur des traits ont de meilleures performances sur les jeux de données lexicalement plus riches (par exemple sur des langues à morphologie riche), alors que les résultats obtenus par les approches neuronales sont meilleurs sur les jeux de données dont la variabilité lexicale est moindre (par exemple pour l’anglais). Ces conclusions sont vraies en particulier pour nos modèles de type MEMM faisant usage de notre système MElt, qui s’appuie sur un jeu de traits renouvelé. Ceci montre que, sous certaines conditions, les approches par traits enrichies par des lexiques morphosyntaxiques sont compétitifs par rapport aux approches neuronales. Mots-clés : Étiquetage en partie du discours, Modèles reposant sur des traits, Modèles neuronaux, MEMM, CRF, bi-LSTM, Analyse multilingue External Lexical Information for Multilingual Part-of-Speech Tagging 3

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Evaluating the Impact of External Lexical Resources into a CRF-based Multiword Segmenter and Part-of-Speech Tagger

Résumé This paper evaluates the impact of external lexical resources into a CRF-based joint Multiword Segmenter and Part-of-Speech Tagger. We especially show different ways of integrating lexicon-based features in the tagging model. We display an absolute gain of 0.5% in terms of f-measure. Moreover, we show that the integration of lexicon-based features significantly compensates the use of a s...

متن کامل

An improved joint model: POS tagging and dependency parsing

Dependency parsing is a way of syntactic parsing and a natural language that automatically analyzes the dependency structure of sentences, and the input for each sentence creates a dependency graph. Part-Of-Speech (POS) tagging is a prerequisite for dependency parsing. Generally, dependency parsers do the POS tagging task along with dependency parsing in a pipeline mode. Unfortunately, in pipel...

متن کامل

An Overview of Data-Driven Part-of-Speech Tagging

Over the last twenty years or so, the approaches to partof-speech tagging based on machine learning techniques have been developed or ported to provide high-accuracy morpho-lexical annotation for an increasing number of languages. Given the large number of morpho-lexical descriptors for a morphologically complex language, one has to consider ways to avoid the data sparseness threat in standard ...

متن کامل

Adding More Languages Improves Unsupervised Multilingual Part-of-Speech Tagging: a Bayesian Non-Parametric Approach

We investigate the problem of unsupervised part-of-speech tagging when raw parallel data is available in a large number of languages. Patterns of ambiguity vary greatly across languages and therefore even unannotated multilingual data can serve as a learning signal. We propose a non-parametric Bayesian model that connects related tagging decisions across languages through the use of multilingua...

متن کامل

برچسب‌گذاری ادات سخن زبان فارسی با استفاده از مدل شبکۀ فازی

Part of speech tagging (POS tagging) is an ongoing research in natural language processing (NLP) applications. The process of classifying words into their parts of speech and labeling them accordingly is known as part-of-speech tagging, POS-tagging, or simply tagging. Parts of speech are also known as word classes or lexical categories. The purpose of POS tagging is determining the grammatical ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1606.03676  شماره 

صفحات  -

تاریخ انتشار 2016